feat(gateway): add TOML configuration file (RFC 0003)#1317
Conversation
|
Label |
bf3670f to
02df5f7
Compare
drew
left a comment
There was a problem hiding this comment.
Ran this through codex, these seem reasonable to consider. "Restart pods when the rendered ConfigMap changes" feels possibly out of scope.
Review Comments
[P1] Move Helm grpc_endpoint into a supported table
/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/deploy/helm/openshell/templates/gateway-config.yaml:36
The generated Helm config always writes grpc_endpoint under [openshell.gateway], but GatewayFileSection uses deny_unknown_fields and has no grpc_endpoint field. Default chart installs will fail during config_file::load before the gateway starts; this needs to move under
[openshell.drivers.kubernetes] or be added to the gateway schema and merge path.
[P1] Keep Kubernetes imagePullPolicy empty by default
/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-driver-kubernetes/src/config.rs:76
When no Kubernetes image pull policy is configured, this now defaults to missing, which is the Podman vocabulary and is not a valid Kubernetes imagePullPolicy. Because the driver emits the field whenever it is non-empty, default Kubernetes deployments will create pods rejected by the
API; keep this default empty so Kubernetes applies its own default.
[P1] Preserve driver-table gRPC endpoints
/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-server/src/lib.rs:508
When a TOML file sets [openshell.drivers.kubernetes].grpc_endpoint, kubernetes_config_from_file reads it and this assignment immediately overwrites it with config.grpc_endpoint, which can only come from CLI/env in this patch. File-only Kubernetes configs therefore pass an empty
OPENSHELL_ENDPOINT into sandboxes, so the callback path breaks even though the driver table contains the endpoint.
[P2] Keep disableTls wired after moving config to TOML
/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/deploy/helm/openshell/templates/statefulset.yaml:49
When .Values.server.disableTls is true, the ConfigMap omits the TLS section, but this args/env block no longer passes --disable-tls or OPENSHELL_DISABLE_TLS. RunArgs.disable_tls remains false, so startup still requires TLS cert/key/CA paths and the plaintext Helm option fails.
[P2] Restart pods when the rendered ConfigMap changes
/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/deploy/helm/openshell/templates/statefulset.yaml:65
After moving settings such as log level, sandbox image, TLS paths, and driver options into this ConfigMap, Helm upgrades that change those values no longer change the StatefulSet pod template. The gateway only reads /etc/openshell/gateway.toml at startup, so without a checksum
annotation or similar rollout trigger, pods keep running with stale config.
[P2] Honor health and metrics listener addresses
/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-server/src/cli.rs:511
For config files that set health or metrics on a different interface than the main listener, these lines copy only the port and later build the listener with args.bind_address. For example, a health listener intended for loopback can be exposed on the public bind address; preserve the
full SocketAddr from the TOML file or validate that separate IPs are unsupported.
[P2] Apply ssh_session_ttl_secs from the TOML file
/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-server/src/config_file.rs:101
The loader accepts ssh_session_ttl_secs, but merge_file_into_args and run_from_args never apply it to openshell_core::Config with with_ssh_session_ttl_secs. Operators can set the documented key and get a successful parse while session tokens still use the built-in default TTL.
|
Sounds good, agree these look reasonable to me. |
Introduces an opt-in --config / OPENSHELL_GATEWAY_CONFIG flag that loads a TOML file with gateway-wide settings and per-driver tables. Source precedence is CLI > env > file > built-in default, implemented via clap's ValueSource so existing flags and env vars keep their priority. Driver crates (kubernetes, docker, podman, vm) now derive Deserialize on their config structs. SupervisorSideloadMethod gains Deserialize with kebab-case rename. A per-driver inheritance allowlist on the loader side overlays [openshell.gateway] shared defaults (default_image, supervisor_image, image_pull_policy, guest_tls_*, ssh_handshake_skew_secs, client_tls_secret_name, host_gateway_ip, enable_user_namespaces) onto each [openshell.drivers.<name>] table before deserialization. The Helm chart renders a new gateway-config ConfigMap and mounts it at /etc/openshell/gateway.toml. The migrated OPENSHELL_* env entries are dropped from the StatefulSet — only the Secret-backed OPENSHELL_SSH_HANDSHAKE_SECRET remains. database_url stays on --db-url. Adds examples/gateway/gateway.example.toml and updates architecture/gateway.md with the source precedence and inheritance rules.
…from examples Both fields are scheduled for removal. Remove the example values and the env-only note so the gateway.toml example and the architecture doc stop recommending settings that will not exist much longer.
Adds focused single-driver examples next to the comprehensive gateway.example.toml: kubernetes, docker, podman, and microvm. Each one demonstrates the realistic settings for that driver plus how shared [openshell.gateway] defaults inherit into the driver table. A new unit test (`checked_in_examples_parse`) loads every example through the config_file loader so schema drift fails CI rather than silently shipping a broken example.
Kubernetes and Podman use mutually-incompatible vocabularies for the same
TOML key:
- Kubernetes: `Always | IfNotPresent | Never` (free-form string passed
verbatim to the K8s API).
- Podman: `always | missing | never | newer` (strict lowercase enum
deserialised into `ImagePullPolicy`).
No value means the same thing in both drivers. Sharing the key at
`[openshell.gateway]` scope and inheriting it into every active driver's
table meant any value safe for one driver was either wrong or silently
dropped for the other (`IfNotPresent` → `ImagePullPolicy::Missing` after
`.unwrap_or_default()`). Operators run one driver per gateway, so the
"shared default" never pays for itself.
Make `image_pull_policy` driver-local:
- Remove the field from `GatewayFileSection` and from
`inheritable_keys()` for both Kubernetes and Podman.
- Drop the file→`RunArgs` merge for the gateway-scope key.
- Stop unconditionally clobbering the driver value with
`config.sandbox_image_pull_policy` in the runtime wiring — only apply
the CLI/env override when it was set (and, for Podman, only when it
parses into the lowercase enum).
- Move the key under `[openshell.drivers.kubernetes]` and
`[openshell.drivers.podman]` in every example, the RFC, the
architecture doc, and the Helm-rendered gateway ConfigMap.
The supervisor pull policy follows the same shape: it is K8s-only and
moves into `[openshell.drivers.kubernetes]` alongside `image_pull_policy`
in the Helm template.
02df5f7 to
9f19b02
Compare
Resolves the P1 and P2 issues raised in PR #1317: - Helm gateway ConfigMap moves `grpc_endpoint` under `[openshell.drivers.kubernetes]` so the default install no longer fails the gateway's `deny_unknown_fields` schema check. - `kubernetes_config_from_file` and `podman_config_from_file` only let the gateway-wide CLI/env `grpc_endpoint` overwrite the driver-table value when it was actually supplied, preserving file-only configs. - Kubernetes driver default `image_pull_policy` is now empty (was Podman vocabulary "missing"), so default deployments let the Kubernetes API apply its own policy instead of being rejected. - New `disable_tls` gateway field plumbs `.Values.server.disableTls` through the TOML ConfigMap instead of relying on env vars dropped from the StatefulSet. - StatefulSet pod template now carries a `checksum/gateway-config` annotation so `helm upgrade` rolls pods when the ConfigMap changes. - Auxiliary listener resolution preserves the full `SocketAddr` from `health_bind_address` / `metrics_bind_address`, so a loopback-pinned health port is not silently relocated onto the public bind address. - `ssh_session_ttl_secs` from the file is now applied to `Config` (it was previously accepted by the loader but never read). New regression coverage: cli-level merge tests for the new fields plus helm-unittest assertions for the ConfigMap shape, checksum annotation, and `disable_tls` rendering.
Replaces the per-driver example files under examples/gateway/ with a single published reference page at docs/reference/gateway-config.mdx covering source precedence, layout, the full example, and the four per-driver examples (Kubernetes, Docker, Podman, microVM). Drew flagged during PR #1317 review that the examples belong with the user-facing docs rather than in a sibling examples/ directory. The cross-references in architecture/gateway.md and RFC 0003 are updated to point at the new docs page; the round-trip test in config_file.rs is removed (schema coverage stays on the inline parses_full_example test and per-field merge tests — doc-snippet drift belongs in a separate docs-lint, not in a cross-tree Rust unit test).
|
🌿 Preview your docs: https://nvidia-preview-pr-1317.docs.buildwithfern.com/openshell |
Summary
Implements RFC 0003: an opt-in
--config/OPENSHELL_GATEWAY_CONFIGflag that loads a TOML file with gateway-wide settings and per-driver tables. Source precedence staysCLI > env > file > built-in default, so existing flags andOPENSHELL_*env vars are not affected when no file is supplied. Includes the matching HelmConfigMapcutover.Related Issue
N/A — implements RFC 0003.
Changes
crates/openshell-server/src/config_file.rs): TOML parser withdeny_unknown_fieldseverywhere, version-1 enforcement, env-only rejection ofdatabase_url/ssh_handshake_secret, and adriver_table()that overlays inheritable[openshell.gateway]defaults onto each[openshell.drivers.<name>]table.crates/openshell-server/src/cli.rs): new--configflag;merge_file_into_args()applies file values only when clap'sValueSource::DefaultValueindicates an arg was not supplied; newbuild_vm_config()/build_docker_config()consume merged driver tables.kubernetes,docker,podman,vm): config structs deriveDeserialize/Serializewith struct-level#[serde(default, deny_unknown_fields)].SupervisorSideloadMethodgainsDeserializewithkebab-caserename. RefreshedDefaultimpls where needed.run_server/build_compute_runtimethreadOption<&ConfigFile>through; the ad-hoc env reads for Kubernetes/Podman supervisor settings now fold into the file-driven pipeline (env still wins).templates/gateway-config.yamlConfigMaprenderinggateway.tomlfrom existing.Values.server.*/.Values.service.*/.Values.supervisor.*keys;StatefulSetmounts it at/etc/openshell/gateway.toml, passes--config, and drops the 18 migratedOPENSHELL_*env entries. OnlyOPENSHELL_SSH_HANDSHAKE_SECRET(Secret-backed) remains.architecture/gateway.mddocuments source precedence and the inheritance allowlist;examples/gateway/gateway.example.tomlships a worked example.default_image,supervisor_image,image_pull_policy,client_tls_secret_name,host_gateway_ip,enable_user_namespaces,guest_tls_ca/cert/key.ssh_handshake_skew_secs/ssh_handshake_secretreferences from examples and architecture ahead of their planned removal.Testing
mise run pre-commitpassesconfig_file.rs(parse, secret rejection, version, unknown field, inheritance merge, override) and 6 new tests incli.rs(file value applies when defaulted, CLI overrides file, env overrides file, OIDC block, driver inheritance, driver override). Fullmise run testpasses.test:e2elabel applied; the sandbox-infra path should be exercised against the new--configstartup mode.Manual smoke:
helm lint .clean,helm template .rendersgateway.tomlwith TLS, with TLS disabled, and with OIDC enabled.database_url,ssh_handshake_secret,version = 2, and unknown fields with friendly errors.Checklist
Deferred (explicit follow-ups)
values.yaml.--config-dirdirectory loader (RFC open question Prototype data access layer that can connect to outlook data #2).